Extracting Structured Knowledge for Semantic Web by Mining Wikipedia

نویسنده

  • Kotaro Nakayama
چکیده

Since Wikipedia has become a huge scale database storing wide-range of human knowledge, it is a promising corpus for knowledge extraction. A considerable number of researches on Wikipedia mining have been conducted and the fact that Wikipedia is an invaluable corpus has been confirmed. Wikipedia’s impressive characteristics are not limited to the scale, but also include the dense link structure, URI for word sense disambiguation, well structured Infoboxes, and the category tree. One of the popular approaches in Wikipedia Mining is to use Wikipedia’s category tree as an ontology and a number of researchers proved that Wikipedia’s categories are promising resources for ontology construction by showing significant results. In this work, we try to prove the capability of Wikipedia as a corpus for knowledge extraction and how it works in the Semantic Web environment. We show two achievements; Wikipedia Thesaurus, a huge scale association thesaurus by mining the Wikipedia’s link structure, and Wikipedia Ontology, a Web ontology extracted by mining Wikipedia articles. 1. WIKIPEDIA THESAURUS WikiRelate [3] is one of the pioneers in this research area. The algorithm finds the shortest path between categories which the concepts belong to in a category tree. As a measurement method for two given concepts, it works well. However, it is impossible to extract all related terms for all concepts because we have to search all combinations of category pairs of all concept pairs (2 million × 2 million). Therefore, in our previous research, we proposed pfibf (Path Frequency Inversed Backward Link Frequency), a scalable association thesaurus construction method to measure relatedness among concepts in Wikipedia. The basic strategy of pfibf is quite simple. The relativity between two articles vi and vj is assumed to be strongly affected by the following two factors: • the number of paths from article vi to vj , • the length of each path from article vi to vj . The relativity is strong if there are many paths (sharing of many intermediate articles) between two articles. In addition, the relativity is affected by the path length. In other The method name was lfibf in the past and was changed to pfibf Figure 1: Wikipedia Thesaurus Visualization words, if the articles are placed closely together in the graph of the Web site, the relativity is estimated to be higher than that of farther ones. Therefore, by using all paths from vi to vj given as T = {t1, t2, ..., tn}, the relativity pf (Path Frequency) between them is defined as follows:

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

Computing Semantic Relatedness using DBPedia

Extracting the semantic relatedness of terms is an important topic in several areas, including data mining, information retrieval and web recommendation. This paper presents an approach for computing the semantic relatedness of terms using the knowledge base of DBpedia — a community effort to extract structured information from Wikipedia. Several approaches to extract semantic relatedness from ...

متن کامل

Query Wikification: Mining Structured Queries From Unstructured Information Needs using Wikipedia-based Semantic Analysis

Combining the language model and inference network, as implemented in the Indri search engine, is efficient and verified approach. In this retrieval model, the user’s information need is exhibited as Indri’s Structural Query Language. Although the SQL allows expert users to richly represent its information needs but unfortunately, the complicacy of SQLs make them unpopular in the WEB for ordina...

متن کامل

Information Extraction from Wikipedia Using Pattern Learning

In this paper we present solutions for the crucial task of extracting structured information from massive free-text resources, such as Wikipedia, for the sake of semantic databases serving upcoming Semantic Web technologies. We demonstrate both a verb frame-based approach using deep natural language processing techniques with extraction patterns developed by human knowledge experts and machine ...

متن کامل

Wikipedia Link Structure and Text Mining for Semantic Relation Extraction

Wikipedia, a collaborative Wiki-based encyclopedia, has become a huge phenomenon among Internet users. It covers huge number of concepts of various fields such as Arts, Geography, History, Science, Sports and Games. Since it is becoming a database storing all human knowledge, Wikipedia mining is a promising approach that bridges the Semantic Web and the Social Web (a. k. a. Web 2.0). In fact, i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008